OpenAI has released SWE-bench Verified, aiming to more accurately assess AI performance in software engineering tasks and address the limitations of the original SWE-bench, such as overly strict unit tests, ambiguous problem descriptions, and challenging development environment setups. The new benchmark improves assessment consistency and reliability by introducing a containerized Docker environment, significantly enhancing the performance scoring of AI models. GPT-4o solved 33.2% of the samples under the new benchmark, while the best open-source agent framework has...